-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Enable session roaming across multiple server instances #1519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Enable session roaming across multiple server instances #1519
Conversation
Add session roaming support to StreamableHTTPSessionManager, allowing
sessions to move freely between server instances without requiring
sticky sessions. This enables true horizontal scaling and high
availability for stateful MCP servers.
When a request arrives with a session ID not found in local memory,
the presence of an EventStore allows creating a transport for that
session. EventStore serves dual purposes: storing events (existing)
and proving session existence (new). This eliminates the need for
separate session validation storage.
Changes:
- Add session roaming logic in _handle_stateful_request()
- Extract duplicate server task code into reusable methods
- Update docstrings to document session roaming capability
- Add 8 comprehensive tests for session roaming scenarios
- Add production-ready example with Redis EventStore
- Include Kubernetes and Docker Compose deployment examples
Benefits:
- One store instead of two (EventStore serves both purposes)
- No new APIs or interfaces required
- Minimal code changes (~50 lines in manager)
- 100% backward compatible
- Enables multi-instance deployments without sticky sessions
Example usage:
event_store = RedisEventStore(redis_url="redis://redis:6379")
manager = StreamableHTTPSessionManager(
app=app,
event_store=event_store # Enables session roaming
)
Github-Issue: modelcontextprotocol#520
Github-Issue: modelcontextprotocol#692
Github-Issue: modelcontextprotocol#880
Github-Issue: modelcontextprotocol#1350
Change single quotes to double quotes to comply with prettier formatting requirements.
- Add language specifiers to all code blocks - Fix heading hierarchy (bold text to proper headings) - Add blank lines after headings for better readability - Escape underscores in file paths (__init__.py -> **init**.py)
The transport could be removed from _server_instances by the cleanup task if it crashed immediately after being started. This caused a KeyError when trying to access it from the dictionary. Fixed by keeping a local reference to the transport instead of looking it up again from the dictionary after starting the server task.
Use @contextlib.asynccontextmanager decorator instead of manual __aenter__/__aexit__ implementation for mock_connect functions. Fixes test failures in: - test_transport_server_task_cleanup_on_exception - test_transport_server_task_no_cleanup_on_terminated
Add AsyncIterator import and use proper return type annotation for mock_connect functions: AsyncIterator[tuple[AsyncMock, AsyncMock]] instead of Any.
The tests were failing because AsyncMock(return_value=None) caused app.run to complete immediately, which closed the transport streams and triggered cleanup that removed transports from _server_instances before assertions could check for them. Now using mock_app_run that calls anyio.sleep_forever() and blocks until the test context cancels it. This keeps transports alive during the test assertions.
1c9b3ce to
ce114b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 There are a few extra .md files, but the logic looks sound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this file? It's nice information, but none of the other examples have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this information seems to be in README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the edge about this script. None of the examples have bash scripts, but it's a nice DX sanity check.
Motivation and Context
Problem
When deploying MCP servers across multiple instances (Kubernetes pods, Docker containers, worker processes), sessions are tied to the specific instance that created them. This requires sticky sessions at the load balancer level and prevents true horizontal scaling. Users are currently forced to choose between:
This limitation is documented in multiple issues: #520 (multi-worker sessions), #692 (session reuse across instances), #880 (horizontal scalability), and #1350 (sticky session problems).
Solution
This PR enables session roaming - allowing sessions to seamlessly move between server instances without requiring sticky sessions. The key insight is that EventStore already serves as proof of session existence.
When a request arrives with a session ID that's not in an instance's local memory, if an EventStore is configured, the instance can safely:
What Changed
Modified
streamable_http_manager.py(~50 lines):_handle_stateful_request()Added comprehensive tests (
test_session_roaming.py, 510 lines):Added production-ready example (
simple-streamablehttp-roaming/, 13 files):Why This Approach
Previous Attempts
Eplored two other approaches before arriving at this solution:
Custom Session Store (outside SDK) - Created own session validation in the application layer, but this didn't solve the core problem and required every user to implement their own solution, it also meant that as the dict that contained session in the sdk was unchanged it still required sticky sessions.
SessionStore ABC (in SDK) - Added a new
SessionStoreinterface requiring bothEventStore+SessionStoreparameters. While functional, this approach required two separate storage backends and was more complex than necessary. It also meant that if you did not supply one of the stores it was not really stateful.Current Approach: EventStore-Only
The key insight: EventStore already proves sessions existed. If events exist for a session ID, that session must have existed to create those events. No separate SessionStore needed.
Benefits:
Usage
Before (Requires Sticky Sessions)
After (No Sticky Sessions Needed)
How It Works
How Has This Been Tested?
The included example also demonstrates:
Breaking Changes
None. This is a pure behavior enhancement:
Types of changes
Checklist
Additional context
Related Issues
Closes #520, #692, #880, #1350
This implementation addresses the core limitation described in all these issues: the inability to run stateful MCP servers across multiple instances without sticky sessions.